Consider
aus_arrivals, the quarterly number of international visitors to Australia from several countries for the period 1981 Q1 – 2012 Q3. a. Select one country and describe the time plot.
aus_arrivals |>
filter(Origin == "Japan") |>
autoplot(Arrivals)
- Use differencing to obtain stationary data.
aus_arrivals |>
filter(Origin == "Japan") |>
gg_tsdisplay(Arrivals |> difference(lag=4) |> difference(), plot_type = "partial")
- What can you learn from the ACF graph of the differenced data?
- What can you learn from the PACF graph of the differenced data?
- What model do these graphs suggest?
- Does
ARIMA()give the same model that you chose? If not, which model do you think is better?
aus_arrivals |>
filter(Origin == "Japan") |>
model(ARIMA(Arrivals)) |>
report()
## Series: Arrivals
## Model: ARIMA(0,1,1)(1,1,1)[4]
##
## Coefficients:
## ma1 sar1 sma1
## -0.4228 -0.2305 -0.5267
## s.e. 0.0944 0.1337 0.1246
##
## sigma^2 estimated as 174801727: log likelihood=-1330.66
## AIC=2669.32 AICc=2669.66 BIC=2680.54
The resulting model has an additional seasonal AR(1) component compared to what I guessed. We can compare the two models based on the AICc statistic:
aus_arrivals |>
filter(Origin == "Japan") |>
model(
guess = ARIMA(Arrivals ~ pdq(0,1,1) + PDQ(0,1,1)),
auto = ARIMA(Arrivals)
) |>
glance()
## # A tibble: 2 × 9
## Origin .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
## 1 Japan guess 177223035. -1332. 2670. 2670. 2678. <cpl [0]> <cpl [5]>
## 2 Japan auto 174801727. -1331. 2669. 2670. 2681. <cpl [4]> <cpl [5]>
The automatic model is only slightly better than my guess based on the AICc statistic.
- Write the model in terms of the backshift operator, then without using the backshift operator.
\[ (1-B)(1-B^4)(1-\Phi B^4)y_t = (1+\theta B)(1+\Theta B^4) \varepsilon_t \] \[ \left[1-B - (1 + \Phi)B^4 + (1 + \Phi) B^5 + \Phi B^8 - \Phi B^9\right]y_t = (1+\theta B + \Theta B^4 + \theta\Theta B^5) \varepsilon_t \]
\[ y_t - y_{t-1} - (1 + \Phi)y_{t-4} + (1 + \Phi) y_{t-5} + \Phi y_{t-8} - \Phi y_{t-9} = \varepsilon_t + \theta \varepsilon_{t-1} + \Theta \varepsilon_{t-4} + \theta\Theta \varepsilon_{t-5}. \]
\[ y_t = y_{t-1} + (1 + \Phi)y_{t-4} - (1 + \Phi) y_{t-5} - \Phi y_{t-8} + \Phi y_{t-9} + \varepsilon_t + \theta \varepsilon_{t-1} + \Theta \varepsilon_{t-4} + \theta\Theta \varepsilon_{t-5}. \]
Choose a series from
us_employment, the total employment in different industries in the United States.
- Produce an STL decomposition of the data and describe the trend and seasonality.
leisure <- us_employment |>
filter(Title == "Leisure and Hospitality")
leisure |>
autoplot(Employed)
leisure |>
model(STL(sqrt(Employed) ~ season(window=7))) |>
components() |>
autoplot()
- Do the data need transforming? If so, find a suitable transformation.
Yes. A square root did ok – the remainder series is relatively homoscedastic. No transformation or log transformations led to the remainder series appearing to be heteroscedastic.
leisure |> features(Employed, guerrero)
## # A tibble: 1 × 2
## Series_ID lambda_guerrero
## <chr> <dbl>
## 1 CEU7000000001 -0.216
- Are the data stationary? If not, find an appropriate differencing which yields stationary data.
leisure |>
autoplot(sqrt(Employed) |> difference(lag=12) |> difference())
The double differenced logged data is close to stationary, although the variance has decreased over time.
- Identify a couple of ARIMA models that might be useful in describing the time series. Which of your models is the best according to their AICc values?
leisure |>
gg_tsdisplay(sqrt(Employed) |> difference(lag=12) |> difference(), plot_type="partial")
fit <- leisure |>
model(
arima210011 = ARIMA(sqrt(Employed) ~ pdq(2,1,0) + PDQ(0,1,1)),
arima012011 = ARIMA(sqrt(Employed) ~ pdq(0,1,2) + PDQ(0,1,1))
)
glance(fit)
## # A tibble: 2 × 9
## Series_ID .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
## 1 CEU7000000001 arima210011 0.0380 207. -406. -406. -386. <cpl [2]> <cpl>
## 2 CEU7000000001 arima012011 0.0381 206. -404. -404. -384. <cpl [0]> <cpl>
The ARIMA(2,1,0)(0,1,1) model is better.
- Estimate the parameters of your best model and do diagnostic testing on the residuals. Do the residuals resemble white noise? If not, try to find another ARIMA model which fits better.
fit |>
select(arima210011) |>
gg_tsresiduals()
The tails of the residual distribution are too long, and there is significant autocorrelation at lag 11, as well as some smaller significant spikes elsewhere.
fit <- leisure |>
model(
arima210011 = ARIMA(sqrt(Employed) ~ pdq(2,1,0) + PDQ(0,1,1)),
arima012011 = ARIMA(sqrt(Employed) ~ pdq(0,1,2) + PDQ(0,1,1)),
auto = ARIMA(sqrt(Employed))
)
glance(fit)
## # A tibble: 3 × 9
## Series_ID .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
## 1 CEU7000000001 arima210011 0.0380 207. -406. -406. -386. <cpl [2]> <cpl>
## 2 CEU7000000001 arima012011 0.0381 206. -404. -404. -384. <cpl [0]> <cpl>
## 3 CEU7000000001 auto 0.0365 226. -440. -440. -411. <cpl [2]> <cpl>
fit |> select(auto) |> report()
## Series: Employed
## Model: ARIMA(2,1,2)(0,1,1)[12]
## Transformation: sqrt(Employed)
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1
## 1.6261 -0.9132 -1.4773 0.7937 -0.5443
## s.e. 0.0400 0.0309 0.0535 0.0352 0.0340
##
## sigma^2 estimated as 0.03655: log likelihood=226.22
## AIC=-440.44 AICc=-440.35 BIC=-411.26
The automatically selected ARIMA(2,1,2)(0,1,1) model is better than either of my selections.
fit |>
select(auto) |>
gg_tsresiduals()
The residuals look better, although there is still a significant spike at lag 11.
- Forecast the next 3 years of data. Get the latest figures from https://fred.stlouisfed.org/categories/11 to check the accuracy of your forecasts.
fc <- fit |>
forecast(h = "3 years")
fc |>
filter(.model=="auto") |>
autoplot(us_employment |> filter(year(Month) > 2000))
update <- readr::read_csv("data/CEU7000000001.csv") |>
mutate(
Month = yearmonth(DATE),
Employed = CEU7000000001
) |>
select(Month, Employed) |>
as_tsibble(index=Month) |>
filter(Month >= min(fc$Month))
fc |> accuracy(update)
## # A tibble: 3 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima012011 Test -2887. 3426. 2887. -22.3 22.3 NaN NaN 0.706
## 2 arima210011 Test -2885. 3425. 2885. -22.3 22.3 NaN NaN 0.706
## 3 auto Test -2931. 3463. 2931. -22.7 22.7 NaN NaN 0.705
fc |>
filter(.model=="auto") |>
autoplot(us_employment |> filter(year(Month) > 2000)) +
geom_line(data=update, aes(x=Month, y=Employed), col='red')
The initial forecasts look great, but then the pandemic led to a huge impact on the employment in this industry.
- Eventually, the prediction intervals are so wide that the forecasts are not particularly useful. How many years of forecasts do you think are sufficiently accurate to be usable?
Given the pandemic, about 5 months. Otherwise, perhaps 2–3 years.
Choose one of the following seasonal time series: the Australian production of electricity, cement, or gas (from
aus_production).
- Do the data need transforming? If so, find a suitable transformation.
aus_production |>
autoplot(Electricity)
Yes, these need transforming.
lambda <- aus_production |>
features(Electricity, guerrero) |>
pull(lambda_guerrero)
aus_production |>
autoplot(box_cox(Electricity, lambda))
guerrero() suggests using Box-Cox transformation with parameter \(\lambda=0.52\).
- Are the data stationary? If not, find an appropriate differencing which yields stationary data.
The trend and seasonality show that the data are not stationary.
aus_production |>
gg_tsdisplay(box_cox(Electricity, lambda) |> difference(4), plot_type = "partial")
aus_production |>
gg_tsdisplay(box_cox(Electricity, lambda) |> difference(4) |> difference(1), plot_type = "partial")
- Identify a couple of ARIMA models that might be useful in describing the time series. Which of your models is the best according to their AIC values?
From the above graph, an AR(1) or an MA(1) with a seasonal MA(2) might work. So an ARIMA(1,1,0)(0,1,2) model for the transformed data.
fit <- aus_production |>
model(
manual = ARIMA(box_cox(Electricity, lambda) ~ 0 + pdq(1, 1, 0) + PDQ(0, 1, 2)),
auto = ARIMA(box_cox(Electricity, lambda))
)
fit |>
select(auto) |>
report()
## Series: Electricity
## Model: ARIMA(1,1,4)(0,1,1)[4]
## Transformation: box_cox(Electricity, lambda)
##
## Coefficients:
## ar1 ma1 ma2 ma3 ma4 sma1
## -0.7030 0.2430 -0.4477 -0.1553 -0.2452 -0.5574
## s.e. 0.1739 0.1943 0.0973 0.0931 0.1189 0.1087
##
## sigma^2 estimated as 18.02: log likelihood=-609.08
## AIC=1232.17 AICc=1232.71 BIC=1255.7
glance(fit)
## # A tibble: 2 × 8
## .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
## 1 manual 19.7 -620. 1247. 1248. 1261. <cpl [1]> <cpl [8]>
## 2 auto 18.0 -609. 1232. 1233. 1256. <cpl [1]> <cpl [8]>
Automatic model selection with ARIMA() has also taken a first order difference, and so we can compare the AICc values. This is a challenging ARIMA model to select manually and the automatic model is clearly better.
- Estimate the parameters of your best model and do diagnostic testing on the residuals. Do the residuals resemble white noise? If not, try to find another ARIMA model which fits better.
fit |>
select(auto) |>
gg_tsresiduals()
# to find out the dof for Ljung Box test
fit |> select(auto) |> report()
## Series: Electricity
## Model: ARIMA(1,1,4)(0,1,1)[4]
## Transformation: box_cox(Electricity, lambda)
##
## Coefficients:
## ar1 ma1 ma2 ma3 ma4 sma1
## -0.7030 0.2430 -0.4477 -0.1553 -0.2452 -0.5574
## s.e. 0.1739 0.1943 0.0973 0.0931 0.1189 0.1087
##
## sigma^2 estimated as 18.02: log likelihood=-609.08
## AIC=1232.17 AICc=1232.71 BIC=1255.7
tidy(fit)
## # A tibble: 9 × 6
## .model term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 manual ar1 -0.346 0.0648 -5.34 2.33e- 7
## 2 manual sma1 -0.731 0.0876 -8.34 9.16e-15
## 3 manual sma2 0.0820 0.0858 0.956 3.40e- 1
## 4 auto ar1 -0.703 0.174 -4.04 7.40e- 5
## 5 auto ma1 0.243 0.194 1.25 2.13e- 1
## 6 auto ma2 -0.448 0.0973 -4.60 7.15e- 6
## 7 auto ma3 -0.155 0.0931 -1.67 9.68e- 2
## 8 auto ma4 -0.245 0.119 -2.06 4.04e- 2
## 9 auto sma1 -0.557 0.109 -5.13 6.52e- 7
fit |>
select(auto) |>
augment() |>
features(.innov, ljung_box, dof = 6, lag = 12)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 auto 8.45 0.207
Residuals look reasonable, they resemble white noise.
- Forecast the next 24 months of data using your preferred model.
fit |>
select(auto) |>
forecast(h = "2 years") |>
autoplot(aus_production)
- Compare the forecasts obtained using
ETS().
aus_production |>
model(ETS(Electricity)) |>
forecast(h = "2 years") |>
autoplot(aus_production)
aus_production |>
model(ETS(Electricity)) |>
forecast(h = "2 years") |>
autoplot(aus_production |> filter(year(Quarter) >= 2000)) +
autolayer(fit |> select(auto) |> forecast(h = "2 years"), colour = "red", alpha = 0.4)
## Scale for fill_ramp is already present.
## Adding another scale for fill_ramp, which will replace the existing scale.
For the same time series you used in the previous exercise, try using a non-seasonal model applied to the seasonally adjusted data obtained from STL. Compare the forecasts with those obtained in the previous exercise. Which do you think is the best approach?
lambda <- aus_production |>
features(Electricity, guerrero) |>
pull(lambda_guerrero)
stlarima <- decomposition_model(
STL(box_cox(Electricity, lambda)),
ARIMA(season_adjust)
)
fc <- aus_production |>
model(
ets = ETS(Electricity),
arima = ARIMA(box_cox(Electricity, lambda)),
stlarima = stlarima
) |>
forecast(h = "2 years")
fc |> autoplot(aus_production |> filter(year(Quarter) > 2000), level=95)
The STL-ARIMA approach has higher values and narrower prediction intervals. It is hard to know which is best without comparing against a test set.
For the Australian tourism data (from
tourism): a. Fit a suitable ARIMA model for all data. b. Produce forecasts of your fitted models. c. Check the forecasts for the “Snowy Mountains” and “Melbourne” regions. Do they look reasonable?
fit <- tourism |>
model(arima = ARIMA(Trips))
fc <- fit |> forecast(h="3 years")
fc |>
filter(Region == "Snowy Mountains") |>
autoplot(tourism)
fc |>
filter(Region == "Melbourne") |>
autoplot(tourism)
Both sets of forecasts appear to have captured the underlying trend and seasonality effectively.